Freedom to act (Mushin)

Ben Whalley, Paul Sharpe, Sonja Heintz

Overview

We think the analogy to using R is clear:

  • If you are anxious, stressed or avoidant you will be distracted
  • Getting confident with the basics makes more complex techniques possible

TODO: replace with feelgood video

In this session we cover:

  • Loading data from files
  • Using simple techniques to answer research questions with data
  • Saving intermediate steps using variables

Principles/ideas

  • Using data to answer questions
  • Precision and literal-mindedness of R
  • Paths and directories

Storing data in variables

TODO: replace with video

Video summary:

  • In R, a variable is the name for a container which stores data.
  • We make variables using the assignment operator, which looks like this: <-.
  • Values on the right hand side of <- are stored in the variable on the left hand side.
  • Variables that you create are stored in the Global Environment, which you can see using the Environment pane.
# calculate 40 + 2 and assign the result to a variable
meaning_of_life <- 40 + 2
# print variable
meaning_of_life
[1] 42

As we work, it’s useful to be able to save the results of the code we write.

As one example, we might have a dataset with multiple columns, each holding participants’ answers to an individual questionnaire item. We might want to calculate a new column —— maybe an average of each person’s scores on all of the questions —— and keep track of this so we can use it in later calculations.

Alternatively, we might want to save the result of a specific calculation and use it later on.

To do this we can create a variable.

A variable is just a container to store data in. To make variables we use the assignment operator, which looks like this <-

That is, like an arrow that points to the left. This is a reminder that the results of the calculation on the right hand side will be assigned (stored) in the variable on the left hand side.

The code in this chunk runs the calculation on the right hand side of the assignment operator, 40 + 2, and assigns the result to a new variable named meaningoflife. The output of the chunk is 42, the value of meaningoflife.

Give your variables short names which describe the data they contain. Use the underscore _ if you need to use more than one word e.g. meaning_of_life.

You might wonder where these variables get saved. In most cases, variables you create are stored in what’s called the Global Environment. You can see them in the Environment pane in RStudio. Double-clicking on any variable there will show you what is stored inside the container.

Exercise 1

  1. Open session-2.rmd using the Files pane. This is the workbook you will be using in this session.
  2. Run the first chunk in the workbook.

The output should look like this:

Results of creating meaningoflife variable

Your Environment pane should look like this:

Environment pane after creating variable

Exercise 2

  1. Create a level 3 markdown heading named “Exercise 2” in your workbook
  2. Create a new chunk beneath the heading
  3. Assign the results of the calculation 2 * 35 to the variable seventy
  4. Run the chunk

Your Environment should now look like this:

Environment pane after creating new variable

Exercise 3

  1. Create a level 3 markdown heading named “Exercise 3” in your workbook
  2. Use R to calculate your age in the year 2051.
  3. Save the result in a variable with a descriptive name.

Passing data to commands using the pipe %>%

TODO: replace with video

Video summary:

  • We pass data from one piece of code to another using the pipe command, which looks like this: %>%.
  • A pipeline is a sequence of two or more commands joined by %>%.
  • You can use the assignment operator to store the results of a pipeline in a variable.
# pipe mtcars into head()
mtcars %>% head()
                   mpg cyl disp  hp drat    wt  qsec vs am gear carb
Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

# store first few rows of mtcars
mtcars_head <- mtcars %>%
  head()

Sometimes we need to link together multiple steps in our analysis.

For example, if we’re working with a big dataset we might want to select only some of the columns, and then filter out some of the rows of data, and the finally calculate descriptive statistics.

We could do this by creating lots of variables, each one saving the results at each intermediate step. This can get confusing, though.

Instead we can use what’s known as a ‘pipe’ — it’s another way to link together multiple instructions.

The pipe sends data from one piece of code to another.

The pipe looks like this %>%.

In session 1, you used this command to “pipe” the mtcars dataset into head, which shows just the first few rows:

mtcars %>% head()
                   mpg cyl disp  hp drat    wt  qsec vs am gear carb
Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

You can think of your data as flowing along lengths of pipe, joined by commands which do things to the data, step by step, until the result you want plops out at the end.

Each command should be read as the word “then”, e.g. “pipe mtcars data, then head() it”.

The > in the pipe command reminds you of the direction in which your data is flowing (it only works left to right).

It’s important to know that the pipe command doesn’t store the results of these steps.

Sometimes that’s OK. In our first example we just wanted to look at the first few rows of the mtcars data.

But, you will usually want to save the result of a pipeline in a new variable.
For example, if we wanted to save the first few rows of the mtcars data to a new variable we would write:

mtcars_head <- mtcars %>% head()

Here we combine assignment with a pipeline.

The result of the pipeline (a data.frame containing the first few rows of mtcars) is saved to a new variable called mtcars_head.

You can explore your variables using the Environment pane. A data.frame will have an icon that looks like a spreadsheet. If you [click on the icon], the data.frame is displayed in a new tab in the Source pane.

This tab shows you the same information as printing the data.frame, such as the number of rows and columns, but it also provides tools for exploring the data interactively.

  • The arrows next to the column names allow you to arrange the rows in ascending or descending order based on the column values.
  • The Filter button allows you to specify a value for one or more columns to filter out non-matching rows. For example, we could display just cars with 4 gears. Click the button again to turn off the filter.

Exercise 4

  1. Create a level 3 markdown heading named “Exercise 4” in your workbook. (You should be used to doing this for every exercise by now, so we won’t remind you again.)
  2. Create a new chunk beneath the heading
  3. Load the tidyverse library
  4. Pipe the mpg data.frame into head() and assign the results to a variable called mpg_head
  5. Use the Environment pane to open mpg_head

In 1999, a 6 cylinder, manual transmission, Audio A4 could cover miles per gallon when driven in the city.

Loading data from elsewhere

TODO: replace with video

Video summary:

  • Often we want to load data into R, rather than use built-in datasets.
  • The preferred format for data files in R is comma-separated value (CSV).
  • CSV data can be read using the read_csv() command.
  • You can load data from an internet address (URL) or a file uploaded to the server.

Loading data

In a lot of these sessions we use datasets that are built-in to R because it’s quick and convenient to illustrate the points we make.

[demo opening glancing some built in data like gapminder, iris, mtcars etc]

Normally, though, you will need to load your own data.

R can read data from two places:

  • A URL (web address), if the data file is available on the internet somewhere
  • A file on computer that R is running on

The link below is a URL (web address) for a file containing data about US police shootings.

The final part of the url tells us the name of the file: shootings.csv

The final 3 (sometimes 4) letters of the filename is called the file extension.

Here the file extension is .csv, which stands for ‘comma separated values’ or CSV.

CSV is a common data type. Most data-oriented programmes (e.g. Excel or Open Office or SPSS) can read and write .csv files, so it’s a good choice for storing and sharing data.

If you click on the link [click link in vid] you’ll see the first line is a list of column names separated by commas.

The remaining lines contain rows of data matching the column headings. For example, the value of the arms_category column in row 1 is Guns.

The read_csv() command reads a CSV file, and converts it to a data.frame, which is the format we use in R.

We can use read_csv() to load data from either a file, or over the internet, which is shown in the next video.

Reading CSV files from the internet

TODO: replace with video

Video summary:

  • read_csv('http://...') can load data from a URL.
  • It converts the data to a data.frame.
  • You must assign the loaded data to a variable, which you should give a descriptive name.
  • Use the Environment pane to view data you load using read_csv().
# load data from a URL into a variable
shootings <- read_csv('https://benwhalley.github.io/lifesavR/lifesavr/shootings.csv')

# display data
shootings
# A tibble: 4,895 x 15
      id name   date       manner_of_death  armed   age gender race  city  state
   <dbl> <chr>  <date>     <chr>            <chr> <dbl> <chr>  <chr> <chr> <chr>
 1     3 Tim E… 2015-01-02 shot             gun      53 M      Asian Shel… WA   
 2     4 Lewis… 2015-01-02 shot             gun      47 M      White Aloha OR   
 3     5 John … 2015-01-03 shot and Tasered unar…    23 M      Hisp… Wich… KS   
 4     8 Matth… 2015-01-04 shot             toy …    32 M      White San … CA   
 5     9 Micha… 2015-01-04 shot             nail…    39 M      Hisp… Evans CO   
 6    11 Kenne… 2015-01-04 shot             gun      18 M      White Guth… OK   
 7    13 Kenne… 2015-01-05 shot             gun      22 M      Hisp… Chan… AZ   
 8    15 Brock… 2015-01-06 shot             gun      35 M      White Assa… KS   
 9    16 Autum… 2015-01-06 shot             unar…    34 F      White Burl… IA   
10    17 Lesli… 2015-01-06 shot             toy …    47 M      Black Knox… PA   
# … with 4,885 more rows, and 5 more variables: signs_of_mental_illness <lgl>,
#   threat_level <chr>, flee <chr>, body_camera <lgl>, arms_category <chr>

CSV files are a common format to store and share data. As shown in the previous video, the first line of a CSV file defines the column names, and the remaining lines are rows of data.

The read_csv() command reads a CSV file, and converts it to a data.frame, which is the format we use in R. We can load data either from a file, or over the internet.

In this example, I’m reading a CSV directly over the Internet and storing the resulting data.frame in the variable shootings.

The URL (the link to the CSV file) needs to be in quotes (single or double quotes both work).

shootings <- read_csv('https://benwhalley.github.io/lifesavR/lifesavr/shootings.csv')

Because we made a new variable, the result is stored in the Environment, and we can double-click it to have a look at the data.

An alternative (and recommended) way is to type the name of the variable as a very simple command:

shootings
# A tibble: 4,895 x 15
      id name   date       manner_of_death  armed   age gender race  city  state
   <dbl> <chr>  <date>     <chr>            <chr> <dbl> <chr>  <chr> <chr> <chr>
 1     3 Tim E… 2015-01-02 shot             gun      53 M      Asian Shel… WA   
 2     4 Lewis… 2015-01-02 shot             gun      47 M      White Aloha OR   
 3     5 John … 2015-01-03 shot and Tasered unar…    23 M      Hisp… Wich… KS   
 4     8 Matth… 2015-01-04 shot             toy …    32 M      White San … CA   
 5     9 Micha… 2015-01-04 shot             nail…    39 M      Hisp… Evans CO   
 6    11 Kenne… 2015-01-04 shot             gun      18 M      White Guth… OK   
 7    13 Kenne… 2015-01-05 shot             gun      22 M      Hisp… Chan… AZ   
 8    15 Brock… 2015-01-06 shot             gun      35 M      White Assa… KS   
 9    16 Autum… 2015-01-06 shot             unar…    34 F      White Burl… IA   
10    17 Lesli… 2015-01-06 shot             toy …    47 M      Black Knox… PA   
# … with 4,885 more rows, and 5 more variables: signs_of_mental_illness <lgl>,
#   threat_level <chr>, flee <chr>, body_camera <lgl>, arms_category <chr>

Exercise 5

  1. Create a new chunk.
  2. Read the data stored at https://benwhalley.github.io/lifesavR/lifesavr/shootings.csv
  3. View it using the Environment pane.
  4. View it using glimpse().

Using data from your computer

TODO: replace with video

Video summary:

  • Before you can use data from your computer, you must upload it to the server.
  • Data can be uploaded using the Files pane.
  • Always upload data to the same location as your R code.
  • For data you upload, give read_csv() the path to the CSV file.
  • You must assign the loaded data to a variable, which you should give a descriptive name.
  • Use the Environment pane to view the data.

The Upload button in the Files pane lets you upload a file from your computer to R Studio. R Studio uses file extensions to guess what the file contains. A file extension is a sequence of characters, starting with a . at the end of a file name.

  • .csv - CSV file
  • .rmd - R Markdown file

Make sure that any file you upload has the correct file extension.

We’ll upload shootings.csv from the previous exercise.

  1. Click the Upload button.
  2. Ensure the Target directory is where you want the uploaded file to appear. For this module it should read ~/lifesavr. The ~ (pronounced “tilde”) means your Home directory on the R Studio server. The /lifesavr means the folder named lifesaver in Home.
  3. Click the Choose file button and select the file you want to upload. After you select a file, its name appears next to the button.
  4. Click the **OK** button.

The file should appear in the Files pane in your lifesavr folder.

Exercise 6

  1. Use your web browser to download https://benwhalley.github.io/lifesavR/lifesavr/shootings.csv to your computer.
  2. Upload shootings.csv to the server.
  3. Create a new chunk.
  4. Read shootings.csv into a variable with a descriptive name.

In which city was the earliest recorded shooting?

Selecting rows with filter()

TODO: replace with video

Video summary:

  • The filter() command selects rows from a dataset which match criteria we set.
  • The simplest filter uses == (equals equals), to test if the row is an exact match.
  • We can use other filters like < or > to match criteria in numeric columns.
  • We can combine multiple filters to get exactly the rows we need.
# load gapminder dataset
library(gapminder)

# filter rows where country is equal to the word "Kenya"
# remember to double equals (==) rather than single (=)
gapminder %>% 
  filter(country == "Kenya")
# A tibble: 12 x 6
   country continent  year lifeExp      pop gdpPercap
   <fct>   <fct>     <int>   <dbl>    <int>     <dbl>
 1 Kenya   Africa     1952    42.3  6464046      854.
 2 Kenya   Africa     1957    44.7  7454779      944.
 3 Kenya   Africa     1962    47.9  8678557      897.
 4 Kenya   Africa     1967    50.7 10191512     1057.
 5 Kenya   Africa     1972    53.6 12044785     1222.
 6 Kenya   Africa     1977    56.2 14500404     1268.
 7 Kenya   Africa     1982    58.8 17661452     1348.
 8 Kenya   Africa     1987    59.3 21198082     1362.
 9 Kenya   Africa     1992    59.3 25020539     1342.
10 Kenya   Africa     1997    54.4 28263827     1360.
11 Kenya   Africa     2002    51.0 31386842     1288.
12 Kenya   Africa     2007    54.1 35610177     1463.

# select rows where year is greater than 2000
gapminder %>% 
  filter(year > 2000)
# A tibble: 284 x 6
   country     continent  year lifeExp      pop gdpPercap
   <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
 1 Afghanistan Asia       2002    42.1 25268405      727.
 2 Afghanistan Asia       2007    43.8 31889923      975.
 3 Albania     Europe     2002    75.7  3508512     4604.
 4 Albania     Europe     2007    76.4  3600523     5937.
 5 Algeria     Africa     2002    71.0 31287142     5288.
 6 Algeria     Africa     2007    72.3 33333216     6223.
 7 Angola      Africa     2002    41.0 10866106     2773.
 8 Angola      Africa     2007    42.7 12420476     4797.
 9 Argentina   Americas   2002    74.3 38331121     8798.
10 Argentina   Americas   2007    75.3 40301927    12779.
# … with 274 more rows

# select rows with low life expectancy
gapminder %>% 
  filter(lifeExp < 35)
# A tibble: 33 x 6
   country      continent  year lifeExp      pop gdpPercap
   <fct>        <fct>     <int>   <dbl>    <int>     <dbl>
 1 Afghanistan  Asia       1952    28.8  8425333      779.
 2 Afghanistan  Asia       1957    30.3  9240934      821.
 3 Afghanistan  Asia       1962    32.0 10267083      853.
 4 Afghanistan  Asia       1967    34.0 11537966      836.
 5 Angola       Africa     1952    30.0  4232095     3521.
 6 Angola       Africa     1957    32.0  4561361     3828.
 7 Angola       Africa     1962    34    4826015     4269.
 8 Burkina Faso Africa     1952    32.0  4469979      543.
 9 Burkina Faso Africa     1957    34.9  4713416      617.
10 Cambodia     Asia       1977    31.2  6978607      525.
# … with 23 more rows

# combine multiple filters
gapminder::gapminder %>% 
  filter(country=="Kenya") %>% 
  filter(year > 2000) %>% 
  filter(lifeExp < 55)
# A tibble: 2 x 6
  country continent  year lifeExp      pop gdpPercap
  <fct>   <fct>     <int>   <dbl>    <int>     <dbl>
1 Kenya   Africa     2002    51.0 31386842     1288.
2 Kenya   Africa     2007    54.1 35610177     1463.

The following chunk filters the gapminder dataset to include only rows where the country column equals “Kenya”.

library(gapminder)
gapminder %>% filter(country == "Kenya")
# A tibble: 12 x 6
   country continent  year lifeExp      pop gdpPercap
   <fct>   <fct>     <int>   <dbl>    <int>     <dbl>
 1 Kenya   Africa     1952    42.3  6464046      854.
 2 Kenya   Africa     1957    44.7  7454779      944.
 3 Kenya   Africa     1962    47.9  8678557      897.
 4 Kenya   Africa     1967    50.7 10191512     1057.
 5 Kenya   Africa     1972    53.6 12044785     1222.
 6 Kenya   Africa     1977    56.2 14500404     1268.
 7 Kenya   Africa     1982    58.8 17661452     1348.
 8 Kenya   Africa     1987    59.3 21198082     1362.
 9 Kenya   Africa     1992    59.3 25020539     1342.
10 Kenya   Africa     1997    54.4 28263827     1360.
11 Kenya   Africa     2002    51.0 31386842     1288.
12 Kenya   Africa     2007    54.1 35610177     1463.

The == is called an “operator”. It compares values from the column on the left hand side with the value specified on the right hand side. The value must match the column type. The value "Kenya" was in quotes because the country column is a factor.

The “greater than” operator > filters numeric data.

gapminder %>% filter(year > 2000)
# A tibble: 284 x 6
   country     continent  year lifeExp      pop gdpPercap
   <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
 1 Afghanistan Asia       2002    42.1 25268405      727.
 2 Afghanistan Asia       2007    43.8 31889923      975.
 3 Albania     Europe     2002    75.7  3508512     4604.
 4 Albania     Europe     2007    76.4  3600523     5937.
 5 Algeria     Africa     2002    71.0 31287142     5288.
 6 Algeria     Africa     2007    72.3 33333216     6223.
 7 Angola      Africa     2002    41.0 10866106     2773.
 8 Angola      Africa     2007    42.7 12420476     4797.
 9 Argentina   Americas   2002    74.3 38331121     8798.
10 Argentina   Americas   2007    75.3 40301927    12779.
# … with 274 more rows

This chunk filters rows where year is greater than 2000.

The opposite of the > operator is the < operator. This filters numeric columns which are less than a value.

Combined filters

gapminder::gapminder %>% 
  filter(country=="Kenya") %>% 
  filter(year > 2000)
# A tibble: 2 x 6
  country continent  year lifeExp      pop gdpPercap
  <fct>   <fct>     <int>   <dbl>    <int>     <dbl>
1 Kenya   Africa     2002    51.0 31386842     1288.
2 Kenya   Africa     2007    54.1 35610177     1463.

Exercise 7

Filter gapminder to show countries with a population greater than 100 million.

The results should look like this:

# A tibble: 77 x 6
   country    continent  year lifeExp       pop gdpPercap
   <fct>      <fct>     <int>   <dbl>     <int>     <dbl>
 1 Bangladesh Asia       1987    52.8 103764241      752.
 2 Bangladesh Asia       1992    56.0 113704579      838.
 3 Bangladesh Asia       1997    59.4 123315288      973.
 4 Bangladesh Asia       2002    62.0 135656790     1136.
 5 Bangladesh Asia       2007    64.1 150448339     1391.
 6 Brazil     Americas   1972    59.5 100840058     4986.
 7 Brazil     Americas   1977    61.5 114313951     6660.
 8 Brazil     Americas   1982    63.3 128962939     7031.
 9 Brazil     Americas   1987    65.2 142938076     7807.
10 Brazil     Americas   1992    67.1 155975974     6950.
# … with 67 more rows

Exercise 8

Show countries with a population greater than 100 million and life expectancy greater than 70.

The results should look like this:

# A tibble: 27 x 6
   country   continent  year lifeExp        pop gdpPercap
   <fct>     <fct>     <int>   <dbl>      <int>     <dbl>
 1 Brazil    Americas   2002    71.0  179914212     8131.
 2 Brazil    Americas   2007    72.4  190010647     9066.
 3 China     Asia       1997    70.4 1230075000     2289.
 4 China     Asia       2002    72.0 1280400000     3119.
 5 China     Asia       2007    73.0 1318683096     4959.
 6 Indonesia Asia       2007    70.6  223547000     3541.
 7 Japan     Asia       1967    71.4  100825279     9848.
 8 Japan     Asia       1972    73.4  107188273    14779.
 9 Japan     Asia       1977    75.4  113872473    16610.
10 Japan     Asia       1982    77.1  118454974    19384.
# … with 17 more rows

Sorting data using arrange()

remind them they know how to make scatter and boxplots

  • "what is the size of the largest diamond (by carat) in the diamonds dataset?
  • “what cut were the three largest diamonds in that dataset?”

TODO: replace with video

diamonds %>% arrange(-carat) %>% head(3)
# A tibble: 3 x 10
  carat cut   color clarity depth table price     x     y     z
  <dbl> <ord> <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1  5.01 Fair  J     I1       65.5    59 18018  10.7 10.5   6.98
2  4.5  Fair  J     I1       65.8    58 18531  10.2 10.2   6.72
3  4.13 Fair  H     I1       64.8    61 17329  10    9.85  6.43

Combining filtering and sorting {filtersort}

TODO: replace with video

What was the year Kenyans had the lowest life exp:

gapminder::gapminder %>% filter(country=="Kenya") %>% 
  arrange(lifeExp) %>% 
  head(6)
# A tibble: 6 x 6
  country continent  year lifeExp      pop gdpPercap
  <fct>   <fct>     <int>   <dbl>    <int>     <dbl>
1 Kenya   Africa     1952    42.3  6464046      854.
2 Kenya   Africa     1957    44.7  7454779      944.
3 Kenya   Africa     1962    47.9  8678557      897.
4 Kenya   Africa     1967    50.7 10191512     1057.
5 Kenya   Africa     2002    51.0 31386842     1288.
6 Kenya   Africa     1972    53.6 12044785     1222.

What was the highest year? All that changes is the minus sign (reverse sorting)

gapminder::gapminder %>% 
  filter(country=="Kenya") %>% 
  arrange(-lifeExp) 
# A tibble: 12 x 6
   country continent  year lifeExp      pop gdpPercap
   <fct>   <fct>     <int>   <dbl>    <int>     <dbl>
 1 Kenya   Africa     1987    59.3 21198082     1362.
 2 Kenya   Africa     1992    59.3 25020539     1342.
 3 Kenya   Africa     1982    58.8 17661452     1348.
 4 Kenya   Africa     1977    56.2 14500404     1268.
 5 Kenya   Africa     1997    54.4 28263827     1360.
 6 Kenya   Africa     2007    54.1 35610177     1463.
 7 Kenya   Africa     1972    53.6 12044785     1222.
 8 Kenya   Africa     2002    51.0 31386842     1288.
 9 Kenya   Africa     1967    50.7 10191512     1057.
10 Kenya   Africa     1962    47.9  8678557      897.
11 Kenya   Africa     1957    44.7  7454779      944.
12 Kenya   Africa     1952    42.3  6464046      854.

Combining rows using summarise()

TODO: replace with video

  • Often you have lots of data and need to make summaries of it — e.g. to calculate the average of a column
  • The summarise() function takes many rows and uses a function to convert those into fewer rows.
  • We can use many different functions with summarise, but
  • common choices are functions for descriptive statistics, like mean, median, or sd (short for standard deviation)
mtcars %>% summarise(average_mpg = mean(mpg))
  average_mpg
1    20.09062

Using filter() and summarise() together

  • Using the pipe (%>%), we can combine multiple steps
  • It’s common to want to filter out certain rows, before using summarise
mtcars %>% 
  filter(am==1) %>% 
  summarise(mean(mpg))
  mean(mpg)
1  24.39231

Grouping data with group_by

TODO: replace with video

  • In our data we may have categorical variables (e.g. gender, or country)
  • We often want to compute summaries for each group
  • Using filter(), we could make a summary for each group, one by one; the group_by function does this for us
  • If you add group_by() to a pipeline then all the subsequent steps are run once for each group
  • Be careful only to group by categorical variables

We might make a plot like this:

mtcars %>% 
  ggplot(aes(factor(cyl), mpg)) + 
  geom_boxplot()

But what if we want these numbers in a table (or to report in our report)? We can do that using group_by and summarise…

mtcars %>% 
  group_by(cyl) %>% 
  summarise(average_mpg = mean(mpg))
# A tibble: 3 x 2
    cyl average_mpg
* <dbl>       <dbl>
1     4        26.7
2     6        19.7
3     8        15.1

We can also group by two variables at once and get a row for each combination:

mtcars %>% group_by(cyl, am) %>% summarise(mean(mpg))
# A tibble: 6 x 3
# Groups:   cyl [3]
    cyl    am `mean(mpg)`
  <dbl> <dbl>       <dbl>
1     4     0        22.9
2     4     1        28.1
3     6     0        19.1
4     6     1        20.6
5     8     0        15.0
6     8     1        15.4

Check your knowledge

Write an answer to each of these questions in the Check your knowledge section of your workbook. The answers will be revealed in Session 3.

  • What is the %>% symbol called and what does it do?
  • What is the <- symbol called and what does it do?

Practice problems

Additional questions

  • In the gapminder dataset, what country had the highest life expectancy in 1952? (Use arrange, filter and head)
gapminder::gapminder %>% 
  filter(year == 1952) %>% 
  arrange(-lifeExp) %>% 
  head(1)
# A tibble: 1 x 6
  country continent  year lifeExp     pop gdpPercap
  <fct>   <fct>     <int>   <dbl>   <int>     <dbl>
1 Norway  Europe     1952    72.7 3327728    10095.
  • What continent had the highest GDP in 2011? (Use arrange, group_by, and summarise)
gapminder::gapminder %>% 
  group_by(continent) %>% 
  summarise(average_gdp = mean(gdpPercap)) %>% 
  arrange(-average_gdp)
# A tibble: 5 x 2
  continent average_gdp
  <fct>           <dbl>
1 Oceania        18622.
2 Europe         14469.
3 Asia            7902.
4 Americas        7136.
5 Africa          2194.
  • Make a boxplot showing life expectancy by continent. (Use filter, ggplot and geom_boxplot)
gapminder::gapminder %>% 
  filter(year > 2000) %>% 
  ggplot(aes(continent, lifeExp)) + 
  geom_boxplot()

“Mega problem”

Describe these as the ‘end of level boss characters’. You need to combine all your skills to beat them…

Make a table which shows the average life expectancy for each continent, sorted from highest to lowest:

gapminder::gapminder %>% 
  group_by(continent) %>% 
  summarise(life_expectancy = mean(lifeExp)) %>% 
  arrange(-life_expectancy)
# A tibble: 5 x 2
  continent life_expectancy
  <fct>               <dbl>
1 Oceania              74.3
2 Europe               71.9
3 Americas             64.7
4 Asia                 60.1
5 Africa               48.9

Broken script to fix

  • Fix a ‘broken’ script: Start a NEW R session and make this code work:
liibrary(todyverse)

# make a density plot of of life expectacy with different color lines for each continent
gapminder::gapminder %>% 
  ggplote(aes("lifeExp", colr = "Continent"))  geom_density()

# select only years after 1990
gapminder::gapminder %>% 
  filter(year > 1990)

ggplot(aes(year, lifeExp, color=continent)) + 
  geom_jitter()
  

NOTE - we will know all the errors they will see so can provide hints for each of them

Correct version would be:

library(tidyverse)

# make a density plot of of life expectacy with different color lines for each continent
gapminder::gapminder %>% 
  ggplot(aes(lifeExp, color = continent))  + 
  geom_density()


# select only years after 1990
gapminder::gapminder %>% 
  filter(year > 1990) %>%
  ggplot(aes(year, lifeExp, color=continent)) + 
  geom_jitter()